![]()
There have always been people who would develop new ways to illegally access someone's finances since the payment systems were invented. This has become a huge issue in the modern era, because all purchases can be made online with just your credit card information. Even before two-step verification was employed for online purchasing in the United States in the 2010s, many American retail website users were victims of online transaction fraud. When a data breach results in monetary theft and, as a result, the loss of customers' loyalty and the company's reputation, it puts organisations, consumers, banks, and merchants at danger.
In 2017, unauthorised card operations claimed the lives of 16.7 million people. Furthermore, according to the Federal Trade Commission (FTC), credit card fraud claims increased by 40% in 2017 compared to the previous year. Around 13,000 incidents were reported in California and 8,000 in Florida, the two states with the highest per capita rates of this sort of crime. The amount of money at stake will exceed approximately $30 billion by 2020.
Here are some credit card fraud statistics:
![]()
image source : https://spd.group/machine-learning/credit-card-fraud-detection/
Fraud can be committed in a variety of ways and in a wide range of industries. To make a decision, the majority of detection systems combine a number of fraud detection datasets to create a connected picture of both legitimate and invalid payment data. This decision must consider IP address, geolocation, device identification, “BIN” data, global latitude/longitude, historic transaction patterns, and the actual transaction information. In practice, this means that merchants and issuers deploy analytically based responses that use internal and external data to apply a set of business rules or analytical algorithms to detect fraud.
![]()
Credit Card Fraud Detection with Machine Learning is a process of data investigation by a Data Science team and the development of a model that will provide the best results in revealing and preventing fraudulent transactions. This is achieved through bringing together all meaningful features of card users’ transactions, such as Date, User Zone, Product Category, Amount, Provider, Client’s Behavioral Patterns, etc. The information is then run through a subtly trained model that finds patterns and rules so that it can classify whether a transaction is fraudulent or is legitimate.
Cloning a transaction is often a common method of performing transactions similar to the original transaction or replicating a transaction. This can happen when an organization tries to receive payments from a partner multiple times by sending the same invoice to different departments.
Conventional rule-based fraud detection algorithm does not work well to distinguish a fraudulent transaction from an erroneous or erroneous one. For example, a user might accidentally click the submit button twice or order the same product twice. The best option is if the system can distinguish a fraudulent transaction from a transaction made in error. Here machine learning techniques will be more effective in distinguishing cloning transactions caused by human error from real fraud.
When an individual’s private facts inclusive of a Social Security variety, a mystery query answer, or date of beginning is stolen with the aid of using criminals, they are able to use this facts to carry out monetary operations. A lot of fraudulent transactions are related to identification theft, so monetary fraud prevention structures must pay the maximum interest to growing an evaluation of a user’s conduct.
If there may be a sure regularity withinside the manner a purchaser makes his bills, e. g. a person visits a sure bar as soon as per week on the identical time and usually spends about $ 40 to $ 60. If the identical account is used to make a price at a bar positioned in some other a part of city and for a sum of extra than $60, this conduct might be taken into consideration abnormal. The subsequent circulate might be to ship a verification request to the cardboard variety proprietor in an effort to validate that she or he made the transaction.
Metrics inclusive of fashionable deviation, averages, and excessive/low values are the maximum beneficial to identify abnormal conduct. Separate bills are in comparison with private benchmarks to become aware of transactions with a excessive fashionable deviation. Then, the first-class desire is to validate the account holder if the sort of deviation occurs.
![]()
App fraud is often accompanied by account / identity theft. This means that someone is applying to open a new credit account or credit card in a different name. First, criminals steal documents that will serve as proof of your fake claim.
Anomaly detection helps determine if a transaction has abnormal patterns such as date and time or quantity of items. If the algorithm detects this unusual behavior, the bank account holder is protected by several verification methods.
Credit card theft means illegal copying of a credit or bank card using a device that reads and copies information from the original card. Fraudsters use machines called "skimmers" to extract card numbers and other information about credit cards, store them, and resell them to criminals.
![]()
As with identity theft, suspicious transactions made with an electronic card or manual copies will be disclosed as transaction information. Classification techniques can be used to determine whether a transaction is fraudulent or not based on equipment, geographic location, and information. about customer behavior models.
![]()
Fraudsters can send phishing emails to cardholders. Messages appear perfectly legitimate (like a very similar bank URL and a trustworthy logo) as if they were sent by a bank. Online number and password. If you click the wrong link or provide valuable information in response to a post from a fake banking website, attackers will empty your bank account into the one they have within a few days.
To avoid this fraudulent scheme, artificial intelligence solutions rely on neural networks or pattern recognition. Neural networks can learn suspicious patterns as well as detect classes and groups in order to use these patterns to detect fraud.
![]()
Credit card fraud is commonly induced both through card owner’s negligence together along with his statistics or through a breach in a website’s security.
Here are a few examples:
If your card is lost or stolen, unauthorized debiting of funds may occur; That is, the person who finds it uses it to make a purchase. Criminals can also spoof your name and use a card or order certain items through a mobile phone or computer. There is also the problem of using counterfeit credit cards - counterfeit cards with real account information that have been stolen from the cardholder. This is especially dangerous because the victim has their real card, but they don't know that someone copied their card. These fraudulent cards look legitimate and have a logo and a magnetic code on them. original stripe.Fraudulent credit cards are often destroyed by criminals after several successful payments, shortly before the victim realizes the problem and reports it.
![]()
Detecting fraud transactions is of great importance for any credit card company. We are tasked by a well-known company to detect potential frauds so that customers are not charged for items that they did not purchase.
So the goal is to build a classifier that tells if a transaction is a fraud or not.
The challenge is to recognize fraudulent credit card transactions so that the customers of credit card companies are not charged for items that they did not purchase.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import time
import plotly.graph_objs as go
import plotly.offline as py
import matplotlib.gridspec as gridspec
import scipy as sp
import warnings
warnings.filterwarnings('ignore')
credit = pd.read_csv("creditcard.csv")
credit.head()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
| 4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows × 31 columns
credit.shape
(284807, 31)
credit.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
credit.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Time | 284807.0 | 9.481386e+04 | 47488.145955 | 0.000000 | 54201.500000 | 84692.000000 | 139320.500000 | 172792.000000 |
| V1 | 284807.0 | 3.918649e-15 | 1.958696 | -56.407510 | -0.920373 | 0.018109 | 1.315642 | 2.454930 |
| V2 | 284807.0 | 5.682686e-16 | 1.651309 | -72.715728 | -0.598550 | 0.065486 | 0.803724 | 22.057729 |
| V3 | 284807.0 | -8.761736e-15 | 1.516255 | -48.325589 | -0.890365 | 0.179846 | 1.027196 | 9.382558 |
| V4 | 284807.0 | 2.811118e-15 | 1.415869 | -5.683171 | -0.848640 | -0.019847 | 0.743341 | 16.875344 |
| V5 | 284807.0 | -1.552103e-15 | 1.380247 | -113.743307 | -0.691597 | -0.054336 | 0.611926 | 34.801666 |
| V6 | 284807.0 | 2.040130e-15 | 1.332271 | -26.160506 | -0.768296 | -0.274187 | 0.398565 | 73.301626 |
| V7 | 284807.0 | -1.698953e-15 | 1.237094 | -43.557242 | -0.554076 | 0.040103 | 0.570436 | 120.589494 |
| V8 | 284807.0 | -1.893285e-16 | 1.194353 | -73.216718 | -0.208630 | 0.022358 | 0.327346 | 20.007208 |
| V9 | 284807.0 | -3.147640e-15 | 1.098632 | -13.434066 | -0.643098 | -0.051429 | 0.597139 | 15.594995 |
| V10 | 284807.0 | 1.772925e-15 | 1.088850 | -24.588262 | -0.535426 | -0.092917 | 0.453923 | 23.745136 |
| V11 | 284807.0 | 9.289524e-16 | 1.020713 | -4.797473 | -0.762494 | -0.032757 | 0.739593 | 12.018913 |
| V12 | 284807.0 | -1.803266e-15 | 0.999201 | -18.683715 | -0.405571 | 0.140033 | 0.618238 | 7.848392 |
| V13 | 284807.0 | 1.674888e-15 | 0.995274 | -5.791881 | -0.648539 | -0.013568 | 0.662505 | 7.126883 |
| V14 | 284807.0 | 1.475621e-15 | 0.958596 | -19.214325 | -0.425574 | 0.050601 | 0.493150 | 10.526766 |
| V15 | 284807.0 | 3.501098e-15 | 0.915316 | -4.498945 | -0.582884 | 0.048072 | 0.648821 | 8.877742 |
| V16 | 284807.0 | 1.392460e-15 | 0.876253 | -14.129855 | -0.468037 | 0.066413 | 0.523296 | 17.315112 |
| V17 | 284807.0 | -7.466538e-16 | 0.849337 | -25.162799 | -0.483748 | -0.065676 | 0.399675 | 9.253526 |
| V18 | 284807.0 | 4.258754e-16 | 0.838176 | -9.498746 | -0.498850 | -0.003636 | 0.500807 | 5.041069 |
| V19 | 284807.0 | 9.019919e-16 | 0.814041 | -7.213527 | -0.456299 | 0.003735 | 0.458949 | 5.591971 |
| V20 | 284807.0 | 5.126845e-16 | 0.770925 | -54.497720 | -0.211721 | -0.062481 | 0.133041 | 39.420904 |
| V21 | 284807.0 | 1.473120e-16 | 0.734524 | -34.830382 | -0.228395 | -0.029450 | 0.186377 | 27.202839 |
| V22 | 284807.0 | 8.042109e-16 | 0.725702 | -10.933144 | -0.542350 | 0.006782 | 0.528554 | 10.503090 |
| V23 | 284807.0 | 5.282512e-16 | 0.624460 | -44.807735 | -0.161846 | -0.011193 | 0.147642 | 22.528412 |
| V24 | 284807.0 | 4.456271e-15 | 0.605647 | -2.836627 | -0.354586 | 0.040976 | 0.439527 | 4.584549 |
| V25 | 284807.0 | 1.426896e-15 | 0.521278 | -10.295397 | -0.317145 | 0.016594 | 0.350716 | 7.519589 |
| V26 | 284807.0 | 1.701640e-15 | 0.482227 | -2.604551 | -0.326984 | -0.052139 | 0.240952 | 3.517346 |
| V27 | 284807.0 | -3.662252e-16 | 0.403632 | -22.565679 | -0.070840 | 0.001342 | 0.091045 | 31.612198 |
| V28 | 284807.0 | -1.217809e-16 | 0.330083 | -15.430084 | -0.052960 | 0.011244 | 0.078280 | 33.847808 |
| Amount | 284807.0 | 8.834962e+01 | 250.120109 | 0.000000 | 5.600000 | 22.000000 | 77.165000 | 25691.160000 |
| Class | 284807.0 | 1.727486e-03 | 0.041527 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
credit.nunique()
Time 124592 V1 275663 V2 275663 V3 275663 V4 275663 V5 275663 V6 275663 V7 275663 V8 275663 V9 275663 V10 275663 V11 275663 V12 275663 V13 275663 V14 275663 V15 275663 V16 275663 V17 275663 V18 275663 V19 275663 V20 275663 V21 275663 V22 275663 V23 275663 V24 275663 V25 275663 V26 275663 V27 275663 V28 275663 Amount 32767 Class 2 dtype: int64
credit.isnull().sum()
Time 0 V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 Amount 0 Class 0 dtype: int64
credit.hist(bins = 100, figsize = (25, 30))
plt.suptitle('Histograms of Numerical Columns', fontsize = 35)
plt.show()
print('Number of fraudulent transactions = %d or %d per 100,000 transactions in the dataset'
%(len(credit[credit.Class == 1]), len(credit[credit.Class == 1])/len(credit)*100000))
Number of fraudulent transactions = 492 or 172 per 100,000 transactions in the dataset
fig, ax = plt.subplots(ncols = 4, nrows = 8, figsize = (25,30))
index = 0
ax = ax.flatten()
for col, value in credit.items():
if col != 'type':
sns.boxplot(y = col, data = credit, ax = ax[index])
index += 1
plt.tight_layout(pad = 0.5, w_pad = 0.7, h_pad = 5.0)
fig, ax = plt.subplots(ncols = 4, nrows = 8, figsize = (25,30))
index = 0
ax = ax.flatten()
for col, value in credit.items():
if col != 'type':
sns.distplot(value, ax = ax[index])
index += 1
plt.tight_layout(pad = 0.5, w_pad = 0.7, h_pad = 5.0)
credit["Class"].value_counts()
0 284315 1 492 Name: Class, dtype: int64
sns.set_style("whitegrid")
plt.figure(figsize = (18,5))
sns.countplot(x = 'Class', data = credit)
plt.title('Number of Frauds & Nomral transactions', fontsize = 20)
Text(0.5, 1.0, 'Number of Frauds & Nomral transactions')
temp = credit["Class"].value_counts()
labels = temp.index
sizes = (temp / temp.sum())*100
trace = go.Pie(labels = labels, values = sizes, hoverinfo = 'label+percent')
layout = go.Layout(title = 'Class %')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = "Class")
Only 492 out of 284807 are fraud.
0.173% are fraud.
frauds = credit[credit.Class == 1]
normal = credit[credit.Class == 0]
fig, (ax1, ax2) = plt.subplots(2, 1, sharex = True, figsize = (18,8))
fig.suptitle('Amount per transaction by class (Fraud / normal)', fontsize = 20)
bins = 50
ax1.hist(frauds.Amount, bins = bins)
ax1.set_title('Fraud', fontsize = 20)
ax2.hist(normal.Amount, bins = bins)
ax2.set_title('Normal', fontsize = 20)
plt.xlabel('Amount ($)', fontsize = 18)
plt.ylabel('Number of Transactions', fontsize = 18)
plt.xlim((0, 20000))
plt.yscale('log')
plt.show();
fig, (ax1, ax2) = plt.subplots(2, 1, sharex = True, figsize = (18,8))
fig.suptitle('Time of transaction vs Amount by class', fontsize = 20)
ax1.scatter(frauds.Time, frauds.Amount)
ax1.set_title('Fraud', fontsize = 20)
ax2.scatter(normal.Time, normal.Amount)
ax2.set_title('Normal', fontsize = 20)
plt.xlabel('Time (in Seconds)', fontsize = 20)
plt.ylabel('Amount', fontsize = 20)
plt.show()
features = credit.iloc[:,1:29].columns
plt.figure(figsize = (12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(credit[features]):
ax = plt.subplot(gs[i])
sns.distplot(credit[cn][credit.Class == 1], bins = 50)
sns.distplot(credit[cn][credit.Class == 0], bins = 50)
ax.set_xlabel('')
ax.set_title('histogram of feature: ' + str(cn), fontsize = 20)
plt.show()
plt.figure(figsize = (18,6))
ax = sns.distplot(frauds['Time'], label = 'fraudulent', hist = False)
ax = sns.distplot(normal['Time'], label = 'non fraudulent', hist = False)
ax.set(xlabel = 'Seconds elapsed between the transction and the first transction')
plt.show()
plt.figure(figsize = (15,5))
sns.scatterplot(credit["Amount"], credit["Class"])
plt.title("Amount vs Class scatter plot", fontsize = 20)
plt.show()
print("Amount details of the fraudulent transaction")
frauds.Amount.describe()
Amount details of the fraudulent transaction
count 492.000000 mean 122.211321 std 256.683288 min 0.000000 25% 1.000000 50% 9.250000 75% 105.890000 max 2125.870000 Name: Amount, dtype: float64
print("Amount details of the normal transaction")
normal.Amount.describe()
Amount details of the fraudulent transaction
count 284315.000000 mean 88.291022 std 250.105092 min 0.000000 25% 5.650000 50% 22.000000 75% 77.050000 max 25691.160000 Name: Amount, dtype: float64
X = credit.drop(labels = 'Class', axis = 1) # Features
y = credit.loc[:,'Class'] # Response
del credit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)
del X, y
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(227845, 30) (227845,) (56962, 30) (56962,)
# Prevent view warnings
x_train.is_copy = False
x_test.is_copy = False
x_train['Time'].describe()
count 227845.000000 mean 94707.617670 std 47523.204111 min 0.000000 25% 54086.000000 50% 84609.000000 75% 139261.000000 max 172792.000000 Name: Time, dtype: float64
x_train.loc[:,'Time'] = x_train.Time / 3600
x_test.loc[:,'Time'] = x_test.Time / 3600
x_train['Time'].max() / 24
1.9999074074074075
plt.figure(figsize = (15,5), dpi = 80)
sns.distplot(x_train['Time'], bins = 48, kde = False)
plt.xlim([0,48])
plt.xticks(np.arange(0, 54, 6))
plt.xlabel('Time After First Transaction (hr)', fontsize = 20)
plt.ylabel('Count', fontsize = 20)
plt.title('Transaction Times', fontsize = 20)
Text(0.5, 1.0, 'Transaction Times')
x_train['Amount'].describe()
count 227845.000000 mean 88.709296 std 250.026305 min 0.000000 25% 5.550000 50% 22.000000 75% 77.890000 max 25691.160000 Name: Amount, dtype: float64
plt.figure(figsize = (15,5), dpi = 80)
sns.distplot(x_train['Amount'], bins = 300, kde = False)
plt.ylabel('Count', fontsize = 20)
plt.title('Transaction Amounts', fontsize = 20)
Text(0.5, 1.0, 'Transaction Amounts')
plt.figure(figsize = (15,5), dpi = 80)
sns.boxplot(x_train['Amount'])
plt.title('Transaction Amounts', fontsize = 20)
Text(0.5, 1.0, 'Transaction Amounts')
x_train['Amount'].skew()
16.910303546516744
x_train.loc[:,'Amount'] = x_train['Amount'] + 1e-9 # Shift all amounts by 1e-9
x_train.loc[:,'Amount'], maxlog, (min_ci, max_ci) = sp.stats.boxcox(x_train['Amount'], alpha = 0.01)
maxlog
0.1343656979074871
(min_ci, max_ci)
(0.13291390124731134, 0.1358266545085327)
plt.figure(figsize = (15,5), dpi = 80)
sns.distplot(x_train['Amount'], kde = False)
plt.xlabel('Transformed Amount', fontsize = 20)
plt.ylabel('Count', fontsize = 20)
plt.title('Transaction Amounts (Box-Cox Transformed)', fontsize = 20)
Text(0.5, 1.0, 'Transaction Amounts (Box-Cox Transformed)')
x_train['Amount'].describe()
count 227845.000000 mean 3.985515 std 2.972505 min -6.982733 25% 1.927181 50% 3.831861 75% 5.919328 max 21.680567 Name: Amount, dtype: float64
x_train['Amount'].skew()
0.11421488033443958
x_test.loc[:,'Amount'] = x_test['Amount'] + 1e-9 # Shift all amounts by 1e-9
x_test.loc[:,'Amount'] = sp.stats.boxcox(x_test['Amount'], lmbda = maxlog)
#sns.jointplot(x_train['Time'].apply(lambda x: x % 24), x_train['Amount'], kind = 'hex', stat_func = None, size = 12, xlim = (0,24), ylim = (-7.5,14)).set_axis_labels('Time of Day (hr)', 'Transformed Amount')
pca_vars = ['V%i' % k for k in range(1,29)]
x_train[pca_vars].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 227845.0 | -0.000713 | 1.952399 | -56.407510 | -0.922830 | 0.016743 | 1.315147 | 2.451888 |
| V2 | 227845.0 | -0.001034 | 1.636689 | -72.715728 | -0.599928 | 0.064370 | 0.801738 | 22.057729 |
| V3 | 227845.0 | 0.002557 | 1.514288 | -48.325589 | -0.887861 | 0.180865 | 1.027592 | 9.382558 |
| V4 | 227845.0 | 0.003839 | 1.417086 | -5.683171 | -0.844052 | -0.016750 | 0.746907 | 16.715537 |
| V5 | 227845.0 | -0.002857 | 1.383532 | -113.743307 | -0.693702 | -0.055388 | 0.611056 | 34.801666 |
| V6 | 227845.0 | 0.002085 | 1.333769 | -26.160506 | -0.766195 | -0.271706 | 0.401204 | 73.301626 |
| V7 | 227845.0 | 0.000022 | 1.240239 | -43.557242 | -0.555377 | 0.039185 | 0.569307 | 120.589494 |
| V8 | 227845.0 | 0.000093 | 1.200348 | -73.216718 | -0.208302 | 0.022594 | 0.328079 | 20.007208 |
| V9 | 227845.0 | 0.000243 | 1.096453 | -13.434066 | -0.642006 | -0.051224 | 0.596563 | 15.594995 |
| V10 | 227845.0 | -0.000363 | 1.082580 | -24.588262 | -0.535079 | -0.091877 | 0.455577 | 23.745136 |
| V11 | 227845.0 | 0.000651 | 1.020932 | -4.797473 | -0.761255 | -0.032213 | 0.740921 | 12.018913 |
| V12 | 227845.0 | -0.000834 | 1.000266 | -18.431131 | -0.406597 | 0.141227 | 0.617925 | 7.848392 |
| V13 | 227845.0 | -0.000976 | 0.996678 | -5.791881 | -0.650225 | -0.014920 | 0.663521 | 7.126883 |
| V14 | 227845.0 | 0.002291 | 0.957485 | -19.214325 | -0.423563 | 0.051939 | 0.495181 | 10.526766 |
| V15 | 227845.0 | -0.000595 | 0.916946 | -4.498945 | -0.584357 | 0.047181 | 0.649788 | 8.877742 |
| V16 | 227845.0 | -0.000499 | 0.876978 | -14.129855 | -0.469513 | 0.066080 | 0.522836 | 17.315112 |
| V17 | 227845.0 | 0.000587 | 0.846748 | -25.162799 | -0.484153 | -0.065260 | 0.400067 | 9.207059 |
| V18 | 227845.0 | 0.001448 | 0.838169 | -9.335193 | -0.497660 | -0.001867 | 0.501554 | 5.041069 |
| V19 | 227845.0 | -0.000146 | 0.815140 | -7.213527 | -0.456946 | 0.004371 | 0.459695 | 5.591971 |
| V20 | 227845.0 | -0.000796 | 0.767956 | -54.497720 | -0.212129 | -0.062809 | 0.132873 | 39.420904 |
| V21 | 227845.0 | 0.000023 | 0.733325 | -34.830382 | -0.227952 | -0.029095 | 0.186678 | 27.202839 |
| V22 | 227845.0 | 0.000897 | 0.725353 | -10.933144 | -0.541079 | 0.007661 | 0.529342 | 10.503090 |
| V23 | 227845.0 | 0.000765 | 0.616772 | -36.666000 | -0.162242 | -0.011184 | 0.147825 | 22.083545 |
| V24 | 227845.0 | -0.000380 | 0.605741 | -2.836627 | -0.354208 | 0.040977 | 0.438432 | 4.584549 |
| V25 | 227845.0 | 0.000136 | 0.522184 | -8.696627 | -0.317488 | 0.017081 | 0.351400 | 7.519589 |
| V26 | 227845.0 | -0.000301 | 0.482187 | -2.534330 | -0.327576 | -0.052599 | 0.240554 | 3.517346 |
| V27 | 227845.0 | -0.000046 | 0.405182 | -22.565679 | -0.070767 | 0.001373 | 0.091028 | 31.612198 |
| V28 | 227845.0 | -0.000027 | 0.331048 | -15.430084 | -0.053013 | 0.011255 | 0.078341 | 33.847808 |
plt.figure(figsize = (15,5), dpi = 80)
sns.barplot(x = pca_vars, y = x_train[pca_vars].mean(), color = 'darkblue')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('Mean', fontsize = 20)
plt.title('V1-V28 Means', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 Means')
plt.figure(figsize = (15,5), dpi = 80)
sns.barplot(x = pca_vars, y = x_train[pca_vars].std(), color = 'darkred')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('Standard Deviation', fontsize = 20)
plt.title('V1-V28 Standard Deviations', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 Standard Deviations')
Plot the skewnesses next:
plt.figure(figsize = (15,5), dpi = 80)
sns.barplot(x = pca_vars, y = x_train[pca_vars].skew(), color = 'darkgreen')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('Skewness', fontsize = 20)
plt.title('V1-V28 Skewnesses', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 Skewnesses')
plt.figure(figsize = (15,5), dpi = 80)
sns.distplot(x_train['V8'], bins = 300, kde = False)
plt.ylabel('Count', fontsize = 20)
plt.title('V8', fontsize = 20)
Text(0.5, 1.0, 'V8')
The histogram doesn't show us outliers.
Let's try a boxplot:
plt.figure(figsize = (15,5), dpi = 80)
sns.boxplot(x_train['V8'])
plt.title('V8', fontsize = 20)
Text(0.5, 1.0, 'V8')
The kurtosis method employed in pandas is Fisher’s definition, for which the standard normal distribution has kurtosis 0.
Note the log scale on the y-axis in the plot below:
plt.figure(figsize = (15,5), dpi = 80)
plt.yscale('log')
sns.barplot(x = pca_vars, y = x_train[pca_vars].kurtosis(), color = 'darkorange')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('Kurtosis', fontsize = 20)
plt.title('V1-V28 Kurtoses', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 Kurtoses')
Let's plot the medians:
plt.figure(figsize = (15,5), dpi = 80)
sns.barplot(x = pca_vars, y = x_train[pca_vars].median(), color = 'darkblue')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('Median', fontsize = 20)
plt.title('V1-V28 Medians', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 Medians')
The medians are also roughly zero.
Next let's look at the interquartile ranges (IQR)*:
Pandas does not have a built-in IQR method, but we can use the quantile method to calculate the IQR.
plt.figure(figsize = (15,5), dpi = 80)
sns.barplot(x = pca_vars, y = x_train[pca_vars].quantile(0.75) - x_train[pca_vars].quantile(0.25), color = 'darkred')
plt.xlabel('Column', fontsize = 20)
plt.ylabel('IQR', fontsize = 20)
plt.title('V1-V28 IQRs', fontsize = 20)
Text(0.5, 1.0, 'V1-V28 IQRs')
Mutual information is a non-parametric method to estimate the mutual dependence between two variables. Mutual information of 0 indicates no dependence, and higher values indicate higher dependence.
According to the sklearn User Guide, "mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation."
We have 227,845 training samples, so mutual information should work well. Because the target variable is discrete, we use mutual_info_classif (as opposed to mutual_info_regression for a continuous target).
from sklearn.feature_selection import mutual_info_classif
mutual_infos = pd.Series(data = mutual_info_classif(x_train, y_train, discrete_features = False, random_state = 1),
index = x_train.columns)
mutual_infos.sort_values(ascending = False)
V17 0.008037 V14 0.007977 V10 0.007354 V12 0.007354 V11 0.006607 V16 0.005793 V4 0.004843 V3 0.004755 V18 0.004025 V9 0.003996 V7 0.003941 V2 0.003085 V21 0.002304 V27 0.002271 V6 0.002265 V5 0.002254 V1 0.001990 V8 0.001843 V28 0.001757 Time 0.001722 Amount 0.001422 V19 0.001322 V20 0.001136 V23 0.000827 V24 0.000593 V26 0.000459 V22 0.000388 V25 0.000376 V15 0.000230 V13 0.000205 dtype: float64
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
pipeline_sgd = Pipeline([
('scaler', StandardScaler(copy = False)),
('model', SGDClassifier(max_iter = 1000, tol = 1e-3, random_state = 1, warm_start = True))
])
param_grid_sgd = [{
'model__loss': ['log'],
'model__penalty': ['l1', 'l2'],
'model__alpha': np.logspace(start = -3, stop = 3, num = 20)
}, {
'model__loss': ['hinge'],
'model__alpha': np.logspace(start = -3, stop = 3, num = 20),
'model__class_weight': [None, 'balanced']
}]
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef
MCC_scorer = make_scorer(matthews_corrcoef)
grid_sgd = GridSearchCV(estimator = pipeline_sgd,
param_grid = param_grid_sgd,
scoring = MCC_scorer,
n_jobs = -1,
pre_dispatch = '2*n_jobs',
cv = 5,
verbose = 1,
return_train_score = False)
import warnings
with warnings.catch_warnings(): # Suppress warnings from the matthews_corrcoef function
warnings.simplefilter("ignore")
grid_sgd.fit(x_train, y_train)
Fitting 5 folds for each of 80 candidates, totalling 400 fits
grid_sgd.best_score_
0.8054381462050987
This is a pretty good MCC score---random guessing has a score of 0, and a perfect predictor has a score of 1.
Now check the best hyperparameters found in the grid search:
grid_sgd.best_params_
{'model__alpha': 233.57214690901213,
'model__class_weight': 'balanced',
'model__loss': 'hinge'}
from sklearn.ensemble import RandomForestClassifier
pipeline_rf = Pipeline([
('model', RandomForestClassifier(n_jobs = -1, random_state = 1))
])
param_grid_rf = {'model__n_estimators': [75]}
grid_rf = GridSearchCV(estimator = pipeline_rf,
param_grid = param_grid_rf,
scoring = MCC_scorer,
n_jobs = -1,
pre_dispatch = '2*n_jobs',
cv = 5,
verbose = 1,
return_train_score = False)
grid_rf.fit(x_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('model',
RandomForestClassifier(n_jobs=-1,
random_state=1))]),
n_jobs=-1, param_grid={'model__n_estimators': [75]},
scoring=make_scorer(matthews_corrcoef), verbose=1)
grid_rf.best_score_
0.8596447282953857
grid_rf.best_params_
{'model__n_estimators': 75}
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score, roc_auc_score
def classification_eval(estimator, x_test, y_test):
"""
Print several metrics of classification performance of an estimator, given features X_test and true labels y_test.
Input: estimator or GridSearchCV instance, X_test, y_test
Returns: text printout of metrics
"""
y_pred = estimator.predict(x_test)
# Number of decimal places based on number of samples
dec = np.int64(np.ceil(np.log10(len(y_test))))
print('CONFUSION MATRIX')
print(confusion_matrix(y_test, y_pred), '\n')
print('CLASSIFICATION REPORT')
print(classification_report(y_test, y_pred, digits = dec))
print('SCALAR METRICS')
format_str = '%%13s = %%.%if' % dec
print(format_str % ('MCC', matthews_corrcoef(y_test, y_pred)))
if y_test.nunique() <= 2: # Additional metrics for binary classification
try:
y_score = estimator.predict_proba(x_test)[:,1]
except:
y_score = estimator.decision_function(X_test)
print(format_str % ('AUPRC', average_precision_score(y_test, y_score)))
print(format_str % ('AUROC', roc_auc_score(y_test, y_score)))
print(format_str % ("Cohen's kappa", cohen_kappa_score(y_test, y_pred)))
print(format_str % ('Accuracy', accuracy_score(y_test, y_pred)))
classification_eval(grid_rf, x_test, y_test)
CONFUSION MATRIX
[[56854 10]
[ 15 83]]
CLASSIFICATION REPORT
precision recall f1-score support
0 0.99974 0.99982 0.99978 56864
1 0.89247 0.84694 0.86911 98
accuracy 0.99956 56962
macro avg 0.94610 0.92338 0.93445 56962
weighted avg 0.99955 0.99956 0.99956 56962
SCALAR METRICS
MCC = 0.86919
AUPRC = 0.85098
AUROC = 0.95924
Cohen's kappa = 0.86889
Accuracy = 0.99956
We found that the five variables most correlated with fraud are, in decreasing order, V17, V14, V10, V12, and V11. Only a few preprocessing steps were necessary before constructing predictive models: